In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as ss
sns.set_style('white')
%matplotlib inline
In the IMDb dataset, we have two dimensions (number of votes and rating). How about if we have high dimensional data? First, in many cases, the number of dimensions is not too large. For instance, the "Iris" dataset contains four dimensions of measurements on the three types of iris flower species. It's more than two dimensions, yet still manageable.
This dataset is also included in seaborn, so we can load it.
In [2]:
iris = sns.load_dataset('iris')
iris.head()
Out[2]:
We get four dimensions (sepal_length, sepal_width, petal_length, petal_width). One direct way to visualize them is to have a scatter plot for each pair of dimensions. We can use the pairplot()
function in seaborn to do this.
Try the following code. What do you see?
In [3]:
sns.pairplot(iris)
Out[3]:
We can also color the symbols based on species:
In [4]:
sns.pairplot(iris, hue='species')
Out[4]:
The colors represent the three different iris species, so based on the colors, we can tell that when we draw a scatter plot of a pair of dimensions, whether the plot seperates out the species clearly or not. What do you think are the pair of dimensions that best seperate the species?
The principal component analysis (PCA) is a nice dimensionality reduction method. The goal of dimensionality reduction is, of course, to reduce the number of variables (dimensions, measurements, columns).
For example, in the Iris dataset we have four variables (sepal_length
, sepal_width
, petal_length
, petal_width
). If we can reduce the number of variables to two, then we can easily visualize them. PCA offers one way to do this.
PCA is already implemented in the scikit-learn package, a machine learning library in Python, which should have been included in Anaconda. If not, to install scikit-learn, run:
conda install scikit-learn
or
pip install scikit-learn
Before running PCA, we need to transform the iris
from DataFrame
to Numpy's array object. DataFrame.values returns the Numpy representation of DataFrame
.
Extract the four variable as X and species as Y:
In [5]:
X = iris.values[:, 0:4] # extract the 1st to the 3rd columns of all rows
Y = iris.values[:, 4] # extract the 4th column of all rows
# print(X)
# print(Y)
We can now perform PCA with the following code:
In [6]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2) # set the number of components to 2
X_r = pca.fit(X).transform(X)
#Make a dataframe with the results
df = pd.DataFrame(X_r, columns=['PC1', 'PC2'])
df['species'] = Y
Now we only have two dimensions. We can plot them again with the previous code:
In [7]:
sns.pairplot(df, hue='species')
Out[7]:
Compare with the previous plot. What do you think PCA was doing? How did it reduce dimensionality to 2?
t-SNE (t-Distributed Stochastic Neighbor Embedding) is also tool to visualize high-dimensional data. The technique has become widespread in the field of machine learning, since it has an almost magical ability to create compelling two-dimensonal “maps” from data with hundreds or even thousands of dimensions.
Let's try it out with the iris data.
In [8]:
from sklearn.manifold import TSNE
In [9]:
from sklearn.datasets import load_iris
iris = load_iris()
X_tsne = TSNE(learning_rate=100, perplexity=30).fit_transform(iris.data)
In [10]:
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target, cmap='Set1', s=30)
Out[10]:
The hyperparameter perplexity
determines how to balance attention between local and global aspects of your data. Changing this parameter (default is 30) can cause drastic changes in the output:
In [11]:
X_tsne = TSNE(learning_rate=100, perplexity=10).fit_transform(iris.data)
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target, cmap='Set1', s=30)
Out[11]:
Experiment with a few different perplexity values. How do you think it influences the result?
In [12]:
#TODO: put your experiments and answers here.
for i in range(0, 150, 10):
X_tsne = TSNE(learning_rate=100, perplexity = i).fit_transform(iris.data)
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.title("Perplexity value: " + str(i))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target, cmap='Set1', s=30)